This data set contains simulated data that mimics customer behavior on the Starbucks rewards mobile app. Once every few days, Starbucks sends out an offer to users of the mobile app. An offer can be merely an advertisement for a drink or an actual offer such as a discount or BOGO (buy one get one free). Some users might not receive any offer during certain weeks.
Not all users receive the same offer, and that is the challenge to solve with this data set.
The task is to combine transaction, demographic and offer data to determine which demographic groups respond best to which offer type. This data set is a simplified version of the real Starbucks app because the underlying simulator only has one product whereas Starbucks actually sells dozens of products.
Every offer has a validity period before the offer expires. As an example, a BOGO offer might be valid for only 5 days. You'll see in the data set that informational offers have a validity period even though these ads are merely providing information about a product; for example, if an informational offer has 7 days of validity, you can assume the customer is feeling the influence of the offer for 7 days after receiving the advertisement.
We are given transactional data showing user purchases made on the app including the timestamp of purchase and the amount of money spent on a purchase. This transactional data also has a record for each offer that a user receives as well as a record for when a user actually views the offer. There are also records for when a user completes an offer.
Keep in mind as well that someone using the app might make a purchase through the app without having received an offer or seen an offer.
A user could receive a discount offer buy 10 dollars get 2 off on Monday. The offer is valid for 10 days from receipt. If the customer accumulates at least 10 dollars in purchases during the validity period, the customer completes the offer.
Customers do not opt into the offers that they receive; in other words, a user can receive an offer, never actually view the offer, and still complete the offer. For example, a user might receive the "buy 10 dollars get 2 dollars off offer", but the user never opens the offer during the 10 day validity period. The customer spends 15 dollars during those ten days. There will be an offer completion record in the data set; however, the customer was not influenced by the offer because the customer never viewed the offer.
portfolio.json (10 offers x 6 fields) - metadata for each offer (duration, type, etc.)
profile.json (17,000 users x 5 fields) - demographic data for each user
transcript.json (306,534 transactions x 4 fields)- records for events (transactions, offers received, offers viewed, and offers completed)
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.graph_objs as go
import plotly.subplots as subplots
sns.set()
import gc
import json
import numpy as np
import pandas as pd
# Data
portfolio = pd.read_json('../data/portfolio.json', orient='records', lines=True)
profile = pd.read_json('../data/profile.json', orient='records', lines=True)
transcript = pd.read_json('../data/transcript.json', orient='records', lines=True)
portfolio.shape, profile.shape, transcript.shape
((10, 6), (17000, 5), (306534, 4))
print(portfolio.info())
portfolio
<class 'pandas.core.frame.DataFrame'> RangeIndex: 10 entries, 0 to 9 Data columns (total 6 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 reward 10 non-null int64 1 channels 10 non-null object 2 difficulty 10 non-null int64 3 duration 10 non-null int64 4 offer_type 10 non-null object 5 id 10 non-null object dtypes: int64(3), object(3) memory usage: 608.0+ bytes None
| reward | channels | difficulty | duration | offer_type | id | |
|---|---|---|---|---|---|---|
| 0 | 10 | [email, mobile, social] | 10 | 7 | bogo | ae264e3637204a6fb9bb56bc8210ddfd |
| 1 | 10 | [web, email, mobile, social] | 10 | 5 | bogo | 4d5c57ea9a6940dd891ad53e9dbe8da0 |
| 2 | 0 | [web, email, mobile] | 0 | 4 | informational | 3f207df678b143eea3cee63160fa8bed |
| 3 | 5 | [web, email, mobile] | 5 | 7 | bogo | 9b98b8c7a33c4b65b9aebfe6a799e6d9 |
| 4 | 5 | [web, email] | 20 | 10 | discount | 0b1e1539f2cc45b7b9fa7c272da2e1d7 |
| 5 | 3 | [web, email, mobile, social] | 7 | 7 | discount | 2298d6c36e964ae4a3e7e9706d1fb8c2 |
| 6 | 2 | [web, email, mobile, social] | 10 | 10 | discount | fafdcd668e3743c1bb461111dcafc2a4 |
| 7 | 0 | [email, mobile, social] | 0 | 3 | informational | 5a8bc65990b245e5a138643cd4eb9837 |
| 8 | 5 | [web, email, mobile, social] | 5 | 5 | bogo | f19421c1d4aa40978ebb69ca19b0e20d |
| 9 | 2 | [web, email, mobile] | 10 | 7 | discount | 2906b810c7d4411798c6938adc9daaa5 |
print(profile.info())
profile.head()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 17000 entries, 0 to 16999 Data columns (total 5 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 gender 14825 non-null object 1 age 17000 non-null int64 2 id 17000 non-null object 3 became_member_on 17000 non-null int64 4 income 14825 non-null float64 dtypes: float64(1), int64(2), object(2) memory usage: 664.2+ KB None
| gender | age | id | became_member_on | income | |
|---|---|---|---|---|---|
| 0 | None | 118 | 68be06ca386d4c31939f3a4f0e3dd783 | 20170212 | NaN |
| 1 | F | 55 | 0610b486422d4921ae7d2bf64640c50b | 20170715 | 112000.0 |
| 2 | None | 118 | 38fe809add3b4fcf9315a9694bb96ff5 | 20180712 | NaN |
| 3 | F | 75 | 78afa995795e4d85b5d9ceeca43f5fef | 20170509 | 100000.0 |
| 4 | None | 118 | a03223e636434f42ac4c3df47e8bac43 | 20170804 | NaN |
print(transcript.info())
transcript.head()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 306534 entries, 0 to 306533 Data columns (total 4 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 person 306534 non-null object 1 event 306534 non-null object 2 value 306534 non-null object 3 time 306534 non-null int64 dtypes: int64(1), object(3) memory usage: 9.4+ MB None
| person | event | value | time | |
|---|---|---|---|---|
| 0 | 78afa995795e4d85b5d9ceeca43f5fef | offer received | {'offer id': '9b98b8c7a33c4b65b9aebfe6a799e6d9'} | 0 |
| 1 | a03223e636434f42ac4c3df47e8bac43 | offer received | {'offer id': '0b1e1539f2cc45b7b9fa7c272da2e1d7'} | 0 |
| 2 | e2127556f4f64592b11af22de27a7932 | offer received | {'offer id': '2906b810c7d4411798c6938adc9daaa5'} | 0 |
| 3 | 8ec6ce2a7e7949b1bf142def7d0e0586 | offer received | {'offer id': 'fafdcd668e3743c1bb461111dcafc2a4'} | 0 |
| 4 | 68617ca6246f4fbc85e91a2a49552598 | offer received | {'offer id': '4d5c57ea9a6940dd891ad53e9dbe8da0'} | 0 |
def show_missing_values(df, encoded_col=None, encoded_val=None):
'''
Show the number and proportion of missing values in every column of a dataframe.
Args:
(1) df (dataframe) - data to inspect
(2) encoded_col (str) - name of column with missing values encoded *
(3) encoded_val (str or int or float) - value that was used to encoded missing values *
* Pass in args (2) and (3) if a column had its missing values encoded
Returns:
Dataframe showing number and proportion of missing values in each column (dataframe).
'''
missing_val = df.isnull().sum().reset_index() # missing values
missing_val.columns = ['feat', 'n_missing']
if encoded_col: # if missing vals of a col were encoded
missing_val.loc[missing_val.feat == encoded_col, 'n_missing'] = (df[encoded_col] == encoded_val).sum()
missing_val['pct_missing'] = np.round(100 * missing_val.n_missing / df.shape[0], 2) # % missing
return missing_val
# Only user data had missing values
show_missing_values(profile, 'age', 118) # missing `age` vals were encoded as 118
| feat | n_missing | pct_missing | |
|---|---|---|---|
| 0 | gender | 2175 | 12.79 |
| 1 | age | 2175 | 12.79 |
| 2 | id | 0 | 0.00 |
| 3 | became_member_on | 0 | 0.00 |
| 4 | income | 2175 | 12.79 |
# Check if all the missing vals are from the same rows
profile[profile.age == 118].info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 2175 entries, 0 to 16994 Data columns (total 5 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 gender 0 non-null object 1 age 2175 non-null int64 2 id 2175 non-null object 3 became_member_on 2175 non-null int64 4 income 0 non-null float64 dtypes: float64(1), int64(2), object(2) memory usage: 102.0+ KB
2175 users did not fill any information when they signed up, so the data is missing in the columns gender, age, and income. Drop this missing data for better analysis.
# Drop users with missing values
missing_cust = profile.loc[profile.gender.isnull(), 'id'].values
missing_cust_events = transcript[transcript.person.isin(missing_cust)].index.values
transcript.drop(missing_cust_events, inplace=True) # drop events
profile.dropna(inplace=True) # drop users
profile.shape, transcript.shape
((14825, 5), (272762, 4))
def get_unique_values(df_col, type_dict=False):
'''
Get the unique values of a dataframe column with lists or the unique keys of a column with dictionaries.
Args:
(1) df_col (Pandas series) - dataframe column with iterable values
(2) type_dict (bool) - True if value type is dictionary or False if value type is list
Returns:
Unique values from column iterables (list[str]).
'''
if type_dict: # for dictonary
df_col = df_col.apply(lambda d: list(d.keys())) # get lists of keys
return np.unique(np.concatenate(df_col.values)).tolist() # get unique values
def extract_from_iterable_col(df, old_col, drop_old_col=True, unique_values=None, unique_keys=None):
'''
Extract data into new columns from a column with iterable values like lists or dictionaries.
If extracting from dictionaries, the new column names will be the dictionary keys and the
values will be dictionary values. If extracting from lists, one-hot encode the list values.
Args:
(1) df (Pandas dataframe) - data containing a column with iterable values
(2) old_col (str) - name of column to extract data from
(3) drop_old_col (bool) - whether or not to drop the old column after extracting data from it
(4) unique_values (list[str]) - pass in unique values if data type of column is list *
(5) unique_values (list[str]) - pass in unique keys if data type of column is dictionary *
* Pass in arg (4) for a list column or arg (5) for a dictionary column
Returns:
Same data with new columns extracted from the old column (Pandas dataframe).
'''
df = df.copy()
if unique_keys is not None: # for dicts
for k in unique_keys:
if ' ' not in k: # to skip duplicate key with space
df[k] = df[old_col].apply(lambda d: d[k] if k in d # get dict val
else (d[k.replace('_', ' ')] if k.replace('_', ' ') in d # for duplicate key with space
else np.NaN)) # val is nan if key is not in dict
elif unique_values is not None: # for lists
for v in unique_values:
new_col = df[old_col].apply(lambda lst: int(v in lst)) # 1 if val is in list else 0
if np.var(new_col): # if new col is not constant
df[v] = new_col # add new col
if drop_old_col:
df.drop(old_col, axis=1, inplace=True)
return df
# Get unique values
channels = get_unique_values(portfolio.channels) # unique channel values from offer
offer_keys = get_unique_values(transcript.value, type_dict=True) # unique offer keys from event
channels, offer_keys
(['email', 'mobile', 'social', 'web'], ['amount', 'offer id', 'offer_id', 'reward'])
# Extract channels from offer metadata
portfolio = extract_from_iterable_col(portfolio, 'channels', unique_values=channels)
portfolio.head()
| reward | difficulty | duration | offer_type | id | mobile | social | web | |
|---|---|---|---|---|---|---|---|---|
| 0 | 10 | 10 | 7 | bogo | ae264e3637204a6fb9bb56bc8210ddfd | 1 | 1 | 0 |
| 1 | 10 | 10 | 5 | bogo | 4d5c57ea9a6940dd891ad53e9dbe8da0 | 1 | 1 | 1 |
| 2 | 0 | 0 | 4 | informational | 3f207df678b143eea3cee63160fa8bed | 1 | 0 | 1 |
| 3 | 5 | 5 | 7 | bogo | 9b98b8c7a33c4b65b9aebfe6a799e6d9 | 1 | 0 | 1 |
| 4 | 5 | 20 | 10 | discount | 0b1e1539f2cc45b7b9fa7c272da2e1d7 | 0 | 0 | 1 |
# Extract dictionary values from event data
transcript = extract_from_iterable_col(transcript, 'value', unique_keys=offer_keys)
transcript.head()
| person | event | time | amount | offer_id | reward | |
|---|---|---|---|---|---|---|
| 0 | 78afa995795e4d85b5d9ceeca43f5fef | offer received | 0 | NaN | 9b98b8c7a33c4b65b9aebfe6a799e6d9 | NaN |
| 2 | e2127556f4f64592b11af22de27a7932 | offer received | 0 | NaN | 2906b810c7d4411798c6938adc9daaa5 | NaN |
| 5 | 389bc3fa690240e798340f5a15918d5c | offer received | 0 | NaN | f19421c1d4aa40978ebb69ca19b0e20d | NaN |
| 7 | 2eeac8d8feae4a8cad5a6af0499a211d | offer received | 0 | NaN | 3f207df678b143eea3cee63160fa8bed | NaN |
| 8 | aa4862eba776480b8bb9c68455b8c2e1 | offer received | 0 | NaN | 0b1e1539f2cc45b7b9fa7c272da2e1d7 | NaN |
# Sort offer types in order of difficulty - informational < discount < bogo
portfolio.sort_values(['offer_type', 'difficulty', 'duration'], ascending=[False, True, False], inplace=True)
portfolio.reset_index(drop=True, inplace=True)
portfolio.head()
| reward | difficulty | duration | offer_type | id | mobile | social | web | |
|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 0 | 4 | informational | 3f207df678b143eea3cee63160fa8bed | 1 | 0 | 1 |
| 1 | 0 | 0 | 3 | informational | 5a8bc65990b245e5a138643cd4eb9837 | 1 | 1 | 0 |
| 2 | 3 | 7 | 7 | discount | 2298d6c36e964ae4a3e7e9706d1fb8c2 | 1 | 1 | 1 |
| 3 | 2 | 10 | 10 | discount | fafdcd668e3743c1bb461111dcafc2a4 | 1 | 1 | 1 |
| 4 | 2 | 10 | 7 | discount | 2906b810c7d4411798c6938adc9daaa5 | 1 | 0 | 1 |
def create_map_dicts(series):
'''
Create a mapping of unique hash strings of a dataframe column to integers.
Args:
series (Pandas series) - dataframe column with hash strings
Returns:
Mapping of unique hash strings to integers (dict)
'''
iss, isn = dict(), 1
for s in series.unique():
iss[s] = isn
isn += 1
return iss
def map_iss_to_num(series, map_dicts):
'''
Map ID hash strings of a dataframe column to integers.
Args:
(1) series (Pandas series) - dataframe column to map
(2) map_dicts (dicts) - mapping with hash strings as keys and integers as values
Returns:
Dataframe column with values mapped to integers (Pandas series).
'''
return series.map(map_dicts)
# Create mappings for offer IDs and user IDs
offer_mapping = create_map_dicts(portfolio.id)
user_mapping = create_map_dicts(profile.id)
len(offer_mapping), len(user_mapping)
(10, 14825)
# Convert offer IDs in offer to integers
portfolio['offer_id'] = map_iss_to_num(portfolio.id, offer_mapping)
portfolio.drop('id', axis=1, inplace=True) # drop column id
portfolio
| reward | difficulty | duration | offer_type | mobile | social | web | offer_id | |
|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 0 | 4 | informational | 1 | 0 | 1 | 1 |
| 1 | 0 | 0 | 3 | informational | 1 | 1 | 0 | 2 |
| 2 | 3 | 7 | 7 | discount | 1 | 1 | 1 | 3 |
| 3 | 2 | 10 | 10 | discount | 1 | 1 | 1 | 4 |
| 4 | 2 | 10 | 7 | discount | 1 | 0 | 1 | 5 |
| 5 | 5 | 20 | 10 | discount | 0 | 0 | 1 | 6 |
| 6 | 5 | 5 | 7 | bogo | 1 | 0 | 1 | 7 |
| 7 | 5 | 5 | 5 | bogo | 1 | 1 | 1 | 8 |
| 8 | 10 | 10 | 7 | bogo | 1 | 1 | 0 | 9 |
| 9 | 10 | 10 | 5 | bogo | 1 | 1 | 1 | 10 |
# Convert user IDs in user column to integers
profile['user_id'] = map_iss_to_num(profile.id, user_mapping)
profile.drop('id', axis=1, inplace=True)
profile.head()
| gender | age | became_member_on | income | user_id | |
|---|---|---|---|---|---|
| 1 | F | 55 | 20170715 | 112000.0 | 1 |
| 3 | F | 75 | 20170509 | 100000.0 | 2 |
| 5 | M | 68 | 20180426 | 70000.0 | 3 |
| 8 | M | 65 | 20180209 | 53000.0 | 4 |
| 12 | M | 58 | 20171111 | 51000.0 | 5 |
# Convert offer and user IDs in event data to integers
transcript['offer_id'] = map_iss_to_num(transcript.offer_id, offer_mapping)
transcript['user_id'] = map_iss_to_num(transcript.person, user_mapping)
transcript.drop('person', axis=1, inplace=True)
transcript.head()
| event | time | amount | offer_id | reward | user_id | |
|---|---|---|---|---|---|---|
| 0 | offer received | 0 | NaN | 7.0 | NaN | 2 |
| 2 | offer received | 0 | NaN | 5.0 | NaN | 3 |
| 5 | offer received | 0 | NaN | 8.0 | NaN | 4 |
| 7 | offer received | 0 | NaN | 1.0 | NaN | 5 |
| 8 | offer received | 0 | NaN | 6.0 | NaN | 6 |
# Missing values in event data
show_missing_values(transcript)
| feat | n_missing | pct_missing | |
|---|---|---|---|
| 0 | event | 0 | 0.00 |
| 1 | time | 0 | 0.00 |
| 2 | amount | 148805 | 54.55 |
| 3 | offer_id | 123957 | 45.45 |
| 4 | reward | 240318 | 88.11 |
| 5 | user_id | 0 | 0.00 |
# Missing value count for each event type
print('Overall event count:')
print(transcript.event.value_counts())
for e in transcript.event.unique():
print(f'\nMissing values for event:', e)
print(transcript[transcript.event == e].isnull().sum().iloc[2:5])
Overall event count: transaction 123957 offer received 66501 offer viewed 49860 offer completed 32444 Name: event, dtype: int64 Missing values for event: offer received amount 66501 offer_id 0 reward 66501 dtype: int64 Missing values for event: offer viewed amount 49860 offer_id 0 reward 49860 dtype: int64 Missing values for event: transaction amount 0 offer_id 123957 reward 123957 dtype: int64 Missing values for event: offer completed amount 32444 offer_id 0 reward 0 dtype: int64
The missing values don't seem to have any problem.
amount and no reward given so there shouldn't be a rewardamountoffer_id or a reward*These values doesn't seem to be actually missing values, fill all of them with 0.
# Fill missing values with 0
transcript.fillna(0, inplace=True)
transcript.isnull().sum().sum()
0
# Check data types in all datasets
print(portfolio.info())
print(profile.info())
print(transcript.info())
<class 'pandas.core.frame.DataFrame'> RangeIndex: 10 entries, 0 to 9 Data columns (total 8 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 reward 10 non-null int64 1 difficulty 10 non-null int64 2 duration 10 non-null int64 3 offer_type 10 non-null object 4 mobile 10 non-null int64 5 social 10 non-null int64 6 web 10 non-null int64 7 offer_id 10 non-null int64 dtypes: int64(7), object(1) memory usage: 768.0+ bytes None <class 'pandas.core.frame.DataFrame'> Int64Index: 14825 entries, 1 to 16999 Data columns (total 5 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 gender 14825 non-null object 1 age 14825 non-null int64 2 became_member_on 14825 non-null int64 3 income 14825 non-null float64 4 user_id 14825 non-null int64 dtypes: float64(1), int64(3), object(1) memory usage: 694.9+ KB None <class 'pandas.core.frame.DataFrame'> Int64Index: 272762 entries, 0 to 306532 Data columns (total 6 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 event 272762 non-null object 1 time 272762 non-null int64 2 amount 272762 non-null float64 3 offer_id 272762 non-null float64 4 reward 272762 non-null float64 5 user_id 272762 non-null int64 dtypes: float64(3), int64(2), object(1) memory usage: 14.6+ MB None
# convert income, offer_id and reward columns to integers
profile['income'] = profile.income.astype(int) # `income` in user data
transcript['offer_id'] = transcript.offer_id.astype(int) # `offer_id` in event data
transcript['reward'] = transcript.reward.astype(int) # `reward` in event data
transcript.head()
| event | time | amount | offer_id | reward | user_id | |
|---|---|---|---|---|---|---|
| 0 | offer received | 0 | 0.0 | 7 | 0 | 2 |
| 2 | offer received | 0 | 0.0 | 5 | 0 | 3 |
| 5 | offer received | 0 | 0.0 | 8 | 0 | 4 |
| 7 | offer received | 0 | 0.0 | 1 | 0 | 5 |
| 8 | offer received | 0 | 0.0 | 6 | 0 | 6 |
# Convert `became_member_on` in user data to datetime format and rename to `user_date`
profile['user_date'] = pd.to_datetime(profile.became_member_on.astype(str))
profile.drop('became_member_on', axis=1, inplace=True)
profile.head()
| gender | age | income | user_id | user_date | |
|---|---|---|---|---|---|
| 1 | F | 55 | 112000 | 1 | 2017-07-15 |
| 3 | F | 75 | 100000 | 2 | 2017-05-09 |
| 5 | M | 68 | 70000 | 3 | 2018-04-26 |
| 8 | M | 65 | 53000 | 4 | 2018-02-09 |
| 12 | M | 58 | 51000 | 5 | 2017-11-11 |
print('Duplicated users:', profile.duplicated().sum())
print('Duplicated events:', transcript.duplicated().sum())
Duplicated users: 0 Duplicated events: 374
# Duplicate events
dupl_events = transcript[transcript.duplicated(keep=False)]
dupl_events.head()
| event | time | amount | offer_id | reward | user_id | |
|---|---|---|---|---|---|---|
| 66122 | offer completed | 168 | 0.0 | 5 | 2 | 1271 |
| 66123 | offer completed | 168 | 0.0 | 5 | 2 | 1271 |
| 66782 | offer completed | 168 | 0.0 | 9 | 10 | 3872 |
| 66783 | offer completed | 168 | 0.0 | 9 | 10 | 3872 |
| 67613 | offer completed | 168 | 0.0 | 7 | 5 | 7462 |
# Duplicated event types
dupl_events.event.value_counts()
offer completed 747 Name: event, dtype: int64
All 374 of the duplicate events are offers completed. It is unlikely that the same user is completing the same offer type within the same hour, so these duplicates can be dropped. .
# Drop duplicate events
transcript.drop_duplicates(inplace=True)
transcript.duplicated().sum()
0
del missing_cust, missing_cust_events, channels, offer_keys, \
offer_mapping, user_mapping, dupl_events, e
gc.collect()
120
def filter_events(event_list, event_df=transcript, event_col='event',
cols_to_keep=['event', 'time', 'offer_id', 'user_id', 'amount'],
merge_with=None, merge_on=None):
'''
Args:
(1) event_list (list[str]) - events to filter
(2) event_df (Pandas df) - event data
(3) event_col (str) - name of event type column
(4) cols_to_keep (list[str]) - columns to keep in the filtered dataframe
(5) merge_with (Pandas dataframe) - offer metadata or user data
(6) merge_on (str) - column to merge on
Returns:
Filtered event data containing only the specified events (Pandas dataframe)
'''
filter_event = event_df.loc[event_df[event_col].isin(event_list), cols_to_keep].reset_index()
if merge_with is not None:
filter_event = pd.merge(filter_event, merge_with, on=merge_on, how='left')
return filter_event
# Filter for "offer received" events and left-merge with offer metadata
received_offers = filter_events(['offer received'], cols_to_keep=['time', 'user_id', 'offer_id'],
merge_with=portfolio, merge_on='offer_id')
print(f'A total of {received_offers.shape[0]} offers were sent out to {received_offers.user_id.nunique()} users.\n')
received_offers.head()
A total of 66501 offers were sent out to 14820 users.
| index | time | user_id | offer_id | reward | difficulty | duration | offer_type | mobile | social | web | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 0 | 2 | 7 | 5 | 5 | 7 | bogo | 1 | 0 | 1 |
| 1 | 2 | 0 | 3 | 5 | 2 | 10 | 7 | discount | 1 | 0 | 1 |
| 2 | 5 | 0 | 4 | 8 | 5 | 5 | 5 | bogo | 1 | 1 | 1 |
| 3 | 7 | 0 | 5 | 1 | 0 | 0 | 4 | informational | 1 | 0 | 1 |
| 4 | 8 | 0 | 6 | 6 | 5 | 20 | 10 | discount | 0 | 0 | 1 |
# Convert time in hours to day
received_offers['day'] = received_offers.time // 24
received_offers.day.value_counts()
24 11165 17 11124 14 11080 7 11061 21 11048 0 11023 Name: day, dtype: int64
print('Offers received by type:', f'\n{received_offers.offer_type.value_counts()}')
print('\nOffers received by ID:', f'\n{received_offers.offer_id.value_counts()}')
Offers received by type: discount 26664 bogo 26537 informational 13300 Name: offer_type, dtype: int64 Offers received by ID: 6 6726 7 6685 9 6683 1 6657 3 6655 4 6652 2 6643 5 6631 10 6593 8 6576 Name: offer_id, dtype: int64
The 10 different offers that were sent out arealmost same in number. There are 4 different discount offers, 4 different bogo offers and only 2 different informational offers. There number of discount offers/bogo offers is twice as informational offers.
Using the term "rewarding offers" to refer to discount and bogo offers together. Informational offers will be considered separately in the next subsection because there is no "completion" event for informational offers.
We will create binary features to indicate whether the offer was viewed and used. These features will help us group the offers that were sent out. There are 4 possibilities that could have happened with users whenever they received an offer:
viewed and 0 for usedviewed and 1 for used# Filter for reward-offer events
offer_events = filter_events(['offer received', 'offer viewed', 'offer completed'],
merge_with=portfolio, merge_on='offer_id')
offer_events = offer_events.query('offer_type != "informational"') # drop informational offers
offer_events.shape
offer_events.head()
| index | event | time | offer_id | user_id | amount | reward | difficulty | duration | offer_type | mobile | social | web | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | offer received | 0 | 7 | 2 | 0.0 | 5 | 5 | 7 | bogo | 1 | 0 | 1 |
| 1 | 2 | offer received | 0 | 5 | 3 | 0.0 | 2 | 10 | 7 | discount | 1 | 0 | 1 |
| 2 | 5 | offer received | 0 | 8 | 4 | 0.0 | 5 | 5 | 5 | bogo | 1 | 1 | 1 |
| 4 | 8 | offer received | 0 | 6 | 6 | 0.0 | 5 | 20 | 10 | discount | 0 | 0 | 1 |
| 5 | 9 | offer received | 0 | 6 | 8 | 0.0 | 5 | 20 | 10 | discount | 0 | 0 | 1 |
def concat_string_cols(df, cols, sep='_', pref='', suff=''):
'''
Args:
(1) df (Pandas dataframe) - data
(2) cols (list[str]) - names of columns to concatenate
(3) sep (str or list[str]) - separator between columns
(4) pref (str) - string to prepend to the final values
(5) suff (str) - string to append to the final values
Returns:
Column with concatenated strings (Pandas series).
'''
if isinstance(sep, str):
sep = [sep] * (len(cols) - 1) + [suff]
new_cols = pref
for i in range(len(cols)):
new_cols += df[cols[i]].astype(str) + sep[i]
return new_cols
# Add expiring time
offer_events['expire_time'] = offer_events.apply(lambda e: e.duration * 24 + e.time if e.event == 'offer received' else -1, axis=1)
# Concatenate offer ID and user ID to group offers
offer_events['off_id_use_id'] = concat_string_cols(offer_events, ['offer_id', 'user_id'])
offer_events = offer_events.sort_values(['off_id_use_id', 'index']).reset_index(drop=True)
offer_events.head()
| index | event | time | offer_id | user_id | amount | reward | difficulty | duration | offer_type | mobile | social | web | expire_time | off_id_use_id | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 91 | offer received | 0 | 10 | 100 | 0.0 | 10 | 10 | 5 | bogo | 1 | 1 | 1 | 120 | 10_100 |
| 1 | 15575 | offer viewed | 6 | 10 | 100 | 0.0 | 10 | 10 | 5 | bogo | 1 | 1 | 1 | -1 | 10_100 |
| 2 | 151484 | offer received | 408 | 10 | 1000 | 0.0 | 10 | 10 | 5 | bogo | 1 | 1 | 1 | 528 | 10_1000 |
| 3 | 182714 | offer viewed | 444 | 10 | 1000 | 0.0 | 10 | 10 | 5 | bogo | 1 | 1 | 1 | -1 | 10_1000 |
| 4 | 185051 | offer completed | 450 | 10 | 1000 | 0.0 | 10 | 10 | 5 | bogo | 1 | 1 | 1 | -1 | 10_1000 |
def fill_val(view_value, comp_value, rows_to_fill, off_evt_df=offer_events, view_col='viewed', comp_col='completed'):
'''
Args:
(1) view_value (int) - 1 if the offer was viewed, else 0
(2) comp_value (int) - 1 if the offer was completed, else 0
(3) rows_to_fill (list[int]) - indices of rows to fill
(4) off_evt_df (Pandas dataframe) - offer event data
(5) view_col (str) - name of binary column indicating whether the offer was viewed
(6) comp_col (str) - name of binary column indicating whether the offer was completed
Returns: None
'''
for row in rows_to_fill:
if view_value in [0, 1]:
off_evt_df.loc[row, view_col] = view_value
if comp_value in [0, 1]:
off_evt_df.loc[row, comp_col] = comp_value
def go_to_event(curr_ids, delta=1, event_df=offer_events):
'''
Args:
(1) curr_ids (int) - current index
(2) delta (int) - change in index
(3) event_df (Pandas dataframe) - event data
Returns:
(1) Index of new event (int).
(2) New event (Pandas series) if index within range
or index (int) if out of range
'''
next_ids = next_evt = curr_ids + delta
if 0 <= next_ids < event_df.shape[0]:
next_evt = event_df.iloc[next_ids]
return next_ids, next_evt
# Create binary features to indicate whether the offer was viewed and completed
offer_events['viewed'] = -1
offer_events['completed'] = -1
offer_events.tail()
| index | event | time | offer_id | user_id | amount | reward | difficulty | duration | offer_type | mobile | social | web | expire_time | off_id_use_id | viewed | completed | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 125766 | 76457 | offer viewed | 186 | 9 | 9999 | 0.0 | 10 | 10 | 7 | bogo | 1 | 1 | 0 | -1 | 9_9999 | -1 | -1 |
| 125767 | 90031 | offer completed | 228 | 9 | 9999 | 0.0 | 10 | 10 | 7 | bogo | 1 | 1 | 0 | -1 | 9_9999 | -1 | -1 |
| 125768 | 119287 | offer received | 336 | 9 | 9999 | 0.0 | 10 | 10 | 7 | bogo | 1 | 1 | 0 | 504 | 9_9999 | -1 | -1 |
| 125769 | 131959 | offer completed | 348 | 9 | 9999 | 0.0 | 10 | 10 | 7 | bogo | 1 | 1 | 0 | -1 | 9_9999 | -1 | -1 |
| 125770 | 136706 | offer viewed | 360 | 9 | 9999 | 0.0 | 10 | 10 | 7 | bogo | 1 | 1 | 0 | -1 | 9_9999 | -1 | -1 |
# Fill `viewed` and `used` values
for ids, iss, evt in offer_events[['off_id_use_id', 'event']].itertuples():
if ids + 2 < offer_events.shape[0] and evt == 'offer received': # current event: offer received
evt2, evt3 = offer_events.iloc[ids + 1], offer_events.iloc[ids + 2] # next 2 event
if evt2.off_id_use_id == iss and evt2.event == 'offer viewed':
if evt3.off_id_use_id == iss and evt3.event == 'offer completed': # next 2 events: offer viewed, offer completed
fill_val(1, 1, [ids, ids + 1, ids + 2])
else: # next 2 events: offer viewed, offer received or offer viewed
fill_val(1, 0, [ids, ids + 1])
elif evt2.off_id_use_id == iss and evt2.event == 'offer completed':
if evt3.off_id_use_id == iss and evt3.event == 'offer viewed': # next 2 events: offer completed, offer viewed
fill_val(0, 1, [ids, ids + 1, ids + 2])
else: # next 2 events: offer completed, offer received or offer completed
fill_val(0, 1, [ids, ids + 1])
else: # next event: "offer received"
fill_val(0, 0, [ids])
# If an event wasn't filled, it was because there were 2 completion events in a row without a "received" or "viewed" in between
elif offer_events.loc[ids, 'viewed'] < 0:
if evt == 'offer viewed': # current event: offer viewed
fill_val(0, 1, [ids])
if evt == 'offer completed': # current event: offer completed
# Go 3 rows back for event 0
ids0, evt0 = go_to_event(ids, -3)
fill_viewed = True
# Iterate event 0 backwards to find the offer this completion belongs to
while not isinstance(evt0, int) and evt0.off_id_use_id == iss: # continue until ID combo changes
if evt0.completed < 1:
if evt0.event == 'offer viewed' and fill_viewed: # event 0: offer viewed
fill_val(-1, 1, [ids0])
fill_viewed = False
if evt0.event == 'offer received': # event 0: offer received
fill_val(-1, 1, [ids0])
fill_val(evt0.viewed, 1, [ids])
break
ids0, evt0 = go_to_event(ids0, -1) # go back 1 row
print('Events not filled:', offer_events.query('viewed < 0 or completed < 0').shape[0])
offer_events.tail()
Events not filled: 0
| index | event | time | offer_id | user_id | amount | reward | difficulty | duration | offer_type | mobile | social | web | expire_time | off_id_use_id | viewed | completed | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 125766 | 76457 | offer viewed | 186 | 9 | 9999 | 0.0 | 10 | 10 | 7 | bogo | 1 | 1 | 0 | -1 | 9_9999 | 1 | 1 |
| 125767 | 90031 | offer completed | 228 | 9 | 9999 | 0.0 | 10 | 10 | 7 | bogo | 1 | 1 | 0 | -1 | 9_9999 | 1 | 1 |
| 125768 | 119287 | offer received | 336 | 9 | 9999 | 0.0 | 10 | 10 | 7 | bogo | 1 | 1 | 0 | 504 | 9_9999 | 0 | 1 |
| 125769 | 131959 | offer completed | 348 | 9 | 9999 | 0.0 | 10 | 10 | 7 | bogo | 1 | 1 | 0 | -1 | 9_9999 | 0 | 1 |
| 125770 | 136706 | offer viewed | 360 | 9 | 9999 | 0.0 | 10 | 10 | 7 | bogo | 1 | 1 | 0 | -1 | 9_9999 | 0 | 1 |
def group_offer(event_row, viewed_cols='viewed', used_cols='completed'):
'''
Helper function to assign a group label to an offer-related event.
Group 1: offers that were neither viewed or completed
Group 2: offers that were viewed, but not completed
Group 3: offers that were completed, but not viewed
Group 4: offers that were both viewed and completed
Args:
(1) event_row (Pandas Series) - row in event data
(2) viewed_cols (str) - name of `viewed` binary column
(3) used_cols (str) - name of `completed` binary column
Returns:
Group number the offer belongs to (int).
'''
if event_row[viewed_cols] == 0 and event_row[used_cols] == 0:
return 1
if event_row[viewed_cols] == 1 and event_row[used_cols] == 0:
return 2
if event_row[viewed_cols] == 0 and event_row[used_cols] == 1:
return 3
if event_row[viewed_cols] == 1 and event_row[used_cols] == 1:
return 4
# Add a `group` feature to group offers as described above
offer_events['group'] = offer_events.apply(group_offer, axis=1)
offer_events.head()
| index | event | time | offer_id | user_id | amount | reward | difficulty | duration | offer_type | mobile | social | web | expire_time | off_id_use_id | viewed | completed | group | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 91 | offer received | 0 | 10 | 100 | 0.0 | 10 | 10 | 5 | bogo | 1 | 1 | 1 | 120 | 10_100 | 1 | 0 | 2 |
| 1 | 15575 | offer viewed | 6 | 10 | 100 | 0.0 | 10 | 10 | 5 | bogo | 1 | 1 | 1 | -1 | 10_100 | 1 | 0 | 2 |
| 2 | 151484 | offer received | 408 | 10 | 1000 | 0.0 | 10 | 10 | 5 | bogo | 1 | 1 | 1 | 528 | 10_1000 | 1 | 1 | 4 |
| 3 | 182714 | offer viewed | 444 | 10 | 1000 | 0.0 | 10 | 10 | 5 | bogo | 1 | 1 | 1 | -1 | 10_1000 | 1 | 1 | 4 |
| 4 | 185051 | offer completed | 450 | 10 | 1000 | 0.0 | 10 | 10 | 5 | bogo | 1 | 1 | 1 | -1 | 10_1000 | 1 | 1 | 4 |
def show_group_count(event_df, plot_counts=True):
'''
Print a summary of offer groups and optionally plot a stacked horizonal bar chart of offer group counts.
Args:
(1) event_df (Pandas dataframe) - event data
(2) plot_counts (bool) - whether to plot the offer group couunts
Returns: None.
'''
# Total offers received and completed
received_offers = event_df.query('event == "offer received"')
completed_offers = received_offers.query('completed == 1')
print('Total offers sent out:', received_offers.shape[0])
print('Offers used:', completed_offers.shape[0])
print('Offers not used:',received_offers.shape[0] - completed_offers.shape[0])
# Number of offers in each group
group_count = received_offers.group.value_counts()
print('\n(Group 1) Offers neither viewed or used:', group_count[1])
print('(Group 2) Offers viewed, but not used:', group_count[2])
print('(Group 3) Offers completed, but not viewed:', group_count[3])
print('(Group 4) Offers both viewed and used:', group_count[4])
# Plot group counts
if plot_counts:
data = [go.Bar(x=[group_count[1], group_count[3]],
y=['Offers not used', 'Offers used'],
name='Not viewed', orientation='h'),
go.Bar(x=[group_count[2], group_count[4]],
y=['Offers not used', 'Offers used'],
name='Viewed', orientation='h')]
go.Figure(data=data, layout=go.Layout(barmode='stack',
title='Groups & Offers',
xaxis=dict(title='Number of offers'))).show()
# Show number of offers in each group
print('Reward Offers')
print('-------------')
show_group_count(offer_events)
Reward Offers ------------- Total offers sent out: 53201 Offers used: 32070 Offers not used: 21131 (Group 1) Offers neither viewed or used: 7165 (Group 2) Offers viewed, but not used: 13966 (Group 3) Offers completed, but not viewed: 9688 (Group 4) Offers both viewed and used: 22382
Out of the total 21,131 unused offers, 7,165 of them weren't even viewed. Out of 32,070 used offers, only 22,382 of them were actually used; i.e. the offer was viewed first before it was used. The remaining 9,688 "used" offers cannot truly be considered used since they were either used without the user having viewed the offer or viewing the offer after it was used. These 9,688 offers also cannot be considered incomplete, so they will likely be separated.
Since informational offers do not carry any rewards, there is no "completion" so these are measured a little differently. There are 2 informational offers with different durations, which are assumed to be the amount of days the advertisement has some influence on the customer after viewing it. We will be seeing which offers were followed by a transaction (with a minimum amount) within the duration of the offer.
# Analyse informational offers and transactions
inform_events = filter_events(['offer received', 'offer viewed', 'transaction'],
merge_with=portfolio, merge_on='offer_id').fillna(-1)
inform_events = inform_events[inform_events.offer_id.isin([0, 1, 2])] # drop reward offers
# Recast metadata columns as float
for col in ['reward', 'difficulty', 'duration', 'mobile', 'social', 'web']:
inform_events[col] = inform_events[col].astype(int)
inform_events.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 146617 entries, 3 to 240317 Data columns (total 13 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 index 146617 non-null int64 1 event 146617 non-null object 2 time 146617 non-null int64 3 offer_id 146617 non-null int32 4 user_id 146617 non-null int64 5 amount 146617 non-null float64 6 reward 146617 non-null int32 7 difficulty 146617 non-null int32 8 duration 146617 non-null int32 9 offer_type 146617 non-null object 10 mobile 146617 non-null int32 11 social 146617 non-null int32 12 web 146617 non-null int32 dtypes: float64(1), int32(7), int64(3), object(2) memory usage: 11.7+ MB
# Drop transactions of users that did not receive an informational offer
users_with_info_offers = inform_events.query('event != "transaction"').user_id.unique()
inform_events = inform_events[inform_events.user_id.isin(users_with_info_offers)]
inform_events.shape
(98029, 13)
# Add expiration timed
inform_events['expire_time'] = inform_events.apply(lambda e: -1 if e.event == 'transaction' else e.duration * 24 + e.time, axis=1)
# Concatenate offer ID and user ID to group offers
inform_events['off_id_use_id'] = concat_string_cols(inform_events, ['offer_id', 'user_id'])
inform_events = inform_events.sort_values(['user_id', 'index']).reset_index(drop=True)
inform_events.head()
| index | event | time | offer_id | user_id | amount | reward | difficulty | duration | offer_type | mobile | social | web | expire_time | off_id_use_id | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 20282 | transaction | 18 | 0 | 1 | 21.51 | -1 | -1 | -1 | -1 | -1 | -1 | -1 | -1 | 0_1 |
| 1 | 49501 | transaction | 144 | 0 | 1 | 32.28 | -1 | -1 | -1 | -1 | -1 | -1 | -1 | -1 | 0_1 |
| 2 | 201571 | offer received | 504 | 1 | 1 | 0.00 | 0 | 0 | 4 | informational | 1 | 0 | 1 | 600 | 1_1 |
| 3 | 227842 | transaction | 528 | 0 | 1 | 23.22 | -1 | -1 | -1 | -1 | -1 | -1 | -1 | -1 | 0_1 |
| 4 | 47582 | transaction | 132 | 0 | 2 | 19.89 | -1 | -1 | -1 | -1 | -1 | -1 | -1 | -1 | 0_2 |
# Create binary features that indicate whether the offer was viewed and used
inform_events['viewed'] = -1
inform_events['completed'] = -1
# Fill `viewed` and `completed` values
for ids, use_id, off_id, evt in inform_events[['user_id', 'offer_id', 'event']].itertuples():
if evt == 'offer received': # current event: offer received
view_value = comp_value = 0 # init current event as not viewed and not completed
rows_to_fill = [ids]
ids2, evt2 = go_to_event(ids, event_df=inform_events) # next event
# Check if the offer was viewed
while not isinstance(evt2, int) and use_id == evt2.user_id:
if evt2.event == 'offer viewed' and evt2.viewed < 0: # next event: offer viewed
view_value = 1 # update to viewed
rows_to_fill.append(ids2) # to apply change to both events
break
ids2, evt2 = go_to_event(ids2, event_df=inform_events) # next event
# Get offer start and end time
if len(rows_to_fill) == 1: # offer was not viewed
ids2, evt2 = go_to_event(ids, 0, event_df=inform_events) # go back to current event
start, end = evt2[['time', 'expire_time']].values
ids2, evt2 = go_to_event(ids2, event_df=inform_events)
# Check for a transaction within the duration
while not isinstance(evt2, int) and use_id == evt2.user_id:
if evt2.event == 'transaction' and evt2.time < end and evt2.amount >= 2.5: # next event: transaction
comp_value = 1 # update to viewed
break
ids2, evt2 = go_to_event(ids2, event_df=inform_events) # next event
# Fill with new value
fill_val(view_value, comp_value, rows_to_fill, inform_events)
print('Offers not filled:', inform_events.query('event != "transaction" and (viewed < 0 or completed < 0)').shape[0])
inform_events.loc[inform_events.event != 'transaction', ['viewed', 'completed']].value_counts()
Offers not filled: 0
viewed completed
1 1 10920
0 7800
0 0 2131
1 1809
dtype: int64
# Add a `group` feature to group offers like with the reward offers
inform_events['group'] = inform_events.apply(group_offer, axis=1).fillna(-1).astype(int)
# Show number of offers in each group
print('Informational Offers')
print('--------------------')
show_group_count(inform_events)
Informational Offers -------------------- Total offers sent out: 13300 Offers used: 7269 Offers not used: 6031 (Group 1) Offers neither viewed or used: 2131 (Group 2) Offers viewed, but not used: 3900 (Group 3) Offers completed, but not viewed: 1809 (Group 4) Offers both viewed and used: 5460
For informational offers, completion means that the user made a transaction of atleast 2 dollars within the duration of the offer. The minimum dollar amount is an assumption that the informational offers are advertising products that cost at least 2 dollars.
# Combine reward and informational offer events
offers = pd.concat([offer_events, inform_events.query('event != "transaction"')])
offers = offers.sort_values('index').reset_index(drop=True)
offers.head()
| index | event | time | offer_id | user_id | amount | reward | difficulty | duration | offer_type | mobile | social | web | expire_time | off_id_use_id | viewed | completed | group | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | offer received | 0 | 7 | 2 | 0.0 | 5 | 5 | 7 | bogo | 1 | 0 | 1 | 168 | 7_2 | 1 | 1 | 4 |
| 1 | 2 | offer received | 0 | 5 | 3 | 0.0 | 2 | 10 | 7 | discount | 1 | 0 | 1 | 168 | 5_3 | 1 | 0 | 2 |
| 2 | 5 | offer received | 0 | 8 | 4 | 0.0 | 5 | 5 | 5 | bogo | 1 | 1 | 1 | 120 | 8_4 | 1 | 1 | 4 |
| 3 | 7 | offer received | 0 | 1 | 5 | 0.0 | 0 | 0 | 4 | informational | 1 | 0 | 1 | 96 | 1_5 | 0 | 0 | 1 |
| 4 | 8 | offer received | 0 | 6 | 6 | 0.0 | 5 | 20 | 10 | discount | 0 | 0 | 1 | 240 | 6_6 | 1 | 0 | 2 |
# Count true offers by ID - unviewed completed offers can not be counted
true_offers = offers.query('event == "offer received" and group != 3') # remove offers that were completed, but not viewed
true_offer = true_offers.offer_id.value_counts().reset_index()
true_offer.columns = ['offer_id', 'n_offers']
# Count completed offers by ID
comp_offers = offers.query('event == "offer received" and group == 4') # offers that were viewed, then completed
comp_offer = comp_offers.offer_id.value_counts().reset_index()
comp_offer.columns = ['offer_id', 'n_completed']
true_offers.shape, comp_offers.shape
((55004, 18), (27842, 18))
# Calculate completion rate by offer ID
comp_offer = pd.merge(comp_offer, true_offer, on='offer_id', how='left')
comp_offer['pct_completed'] = (100 * comp_offer.n_completed / comp_offer.n_offers).round(2)
comp_offer.sort_values('pct_completed', ascending=False, inplace=True)
comp_offer
| offer_id | n_completed | n_offers | pct_completed | |
|---|---|---|---|---|
| 0 | 4 | 4316 | 6011 | 71.80 |
| 1 | 3 | 4101 | 5909 | 69.40 |
| 2 | 8 | 3354 | 5856 | 57.27 |
| 3 | 2 | 3077 | 6130 | 50.20 |
| 5 | 9 | 2554 | 5633 | 45.34 |
| 4 | 10 | 2719 | 6031 | 45.08 |
| 6 | 1 | 2383 | 5361 | 44.45 |
| 8 | 7 | 2020 | 4564 | 44.26 |
| 7 | 5 | 2024 | 4795 | 42.21 |
| 9 | 6 | 1294 | 4714 | 27.45 |
# Merge with offer data
comp_offer = pd.merge(comp_offer, portfolio, on='offer_id', how='left')
# Concatenate offer information for plots
comp_offer['offer'] = concat_string_cols(comp_offer,
['offer_id', 'offer_type', 'difficulty', 'reward', 'duration'],
['. ', ' - \$', ' => \$', ' (exp in ', ' days)'])
comp_offer['summary'] = concat_string_cols(comp_offer,
['n_offers', 'n_completed', 'pct_completed'],
[' | Completed: ', ' (', '%)'], pref='Received: ')
comp_offer.head()
| offer_id | n_completed | n_offers | pct_completed | reward | difficulty | duration | offer_type | mobile | social | web | offer | summary | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 4 | 4316 | 6011 | 71.80 | 2 | 10 | 10 | discount | 1 | 1 | 1 | 4. discount - \$10 => \$2 (exp in 10 days) | Received: 6011 | Completed: 4316 (71.8%) |
| 1 | 3 | 4101 | 5909 | 69.40 | 3 | 7 | 7 | discount | 1 | 1 | 1 | 3. discount - \$7 => \$3 (exp in 7 days) | Received: 5909 | Completed: 4101 (69.4%) |
| 2 | 8 | 3354 | 5856 | 57.27 | 5 | 5 | 5 | bogo | 1 | 1 | 1 | 8. bogo - \$5 => \$5 (exp in 5 days) | Received: 5856 | Completed: 3354 (57.27%) |
| 3 | 2 | 3077 | 6130 | 50.20 | 0 | 0 | 3 | informational | 1 | 1 | 0 | 2. informational - \$0 => \$0 (exp in 3 days) | Received: 6130 | Completed: 3077 (50.2%) |
| 4 | 9 | 2554 | 5633 | 45.34 | 10 | 10 | 7 | bogo | 1 | 1 | 0 | 9. bogo - \$10 => \$10 (exp in 7 days) | Received: 5633 | Completed: 2554 (45.34%) |
# Plot completed offer rates
plt.figure()
sns.barplot(data=comp_offer, x='pct_completed', y='offer', orient='h')
plt.legend(handles=[plt.Rectangle((0,0), 1, 1, color=color) for color in sns.color_palette()[:10]],
labels=comp_offer.summary.tolist(), fontsize='large', bbox_to_anchor=(1, 1.025))
plt.title('Completion Rate of All Offers', fontsize='xx-large')
plt.ylabel('Offer ID and type (spend a => get b, duration)', fontsize='large')
plt.xlabel('Percent of offers completed', fontsize='large');
del true_offers, true_offer, comp_offers, comp_offer, \
received_offers, users_with_info_offers, start, end, \
ids0, ids, ids2, iss, use_id, off_id, evt0, evt, evt2, evt3, \
col, fill_viewed, view_value, comp_value, rows_to_fill
gc.collect()
5235
The above bar chart shows 10 unique offers in the dataset - 4 BOGO, 4 discount, and 2 informational offers. This excludes completed offers that were not viewed before being completed. Informational completions shown above are informational offers where the user viewed the offer and made a transaction of at least 2 dollars within the duration. The following are the highlights:
reward amount isn't as important a factor as the difficulty or duration.It might be our first instinct to assume that the bogo offers would have the highest completion rate since the user is getting a 100% return on their money spent, but this is not the case. This is most likely due to the fact that bogo offers are inherently more difficult to complete regardless of duration because the difficulty amount has to be spent within a single transaction in order to complete the offer.
On the other hand, discount offers allow users to accumulate the difficulty amount with multiple transactions.
For example, offer 4 (spend 10 dollars to get a 2-dollar discount) would be completed if a user made 4 small transactions of 3 dollars each because these transactions accumulate to more than 10 dollars.
# Add signup year feature
profile['signup_year'] = profile.user_date.dt.year
profile.describe()
| age | income | user_id | signup_year | |
|---|---|---|---|---|
| count | 14825.000000 | 14825.000000 | 14825.000000 | 14825.000000 |
| mean | 54.393524 | 65404.991568 | 7413.000000 | 2016.620169 |
| std | 17.383705 | 21598.299410 | 4279.753206 | 1.198245 |
| min | 18.000000 | 30000.000000 | 1.000000 | 2013.000000 |
| 25% | 42.000000 | 49000.000000 | 3707.000000 | 2016.000000 |
| 50% | 55.000000 | 64000.000000 | 7413.000000 | 2017.000000 |
| 75% | 66.000000 | 80000.000000 | 11119.000000 | 2017.000000 |
| max | 101.000000 | 120000.000000 | 14825.000000 | 2018.000000 |
def plot_user_demographic(user_df, title='Demographics Overview',
height=600, width=1000, vspacing=0.15,
gender_col='gender', age_col='age',
signup_col='signup_year', income_col='income'):
'''
Plot the distribution of user demographics - `gender`, `age`, `signup year`, and `income`.
'''
fig = subplots.make_subplots(2, 2, vertical_spacing=vspacing, y_title='Number of users',
subplot_titles=['Gender', 'Age', 'Signup Year', 'Income'])
fig.add_trace(go.Histogram(x=user_df[gender_col].sort_values(), name='Gender'), 1, 1) # gender
fig.add_trace(go.Histogram(x=user_df[age_col], name='Age'), 1, 2) # age
fig.add_trace(go.Histogram(x=user_df[signup_col], name='Signup Year'), 2, 1) # signup year
fig.add_trace(go.Histogram(x=user_df[income_col], name='Income'), 2, 2) # income
fig.update_layout(title=title, height=height, width=width, showlegend=False)
fig.show()
# Plot user demographics
plot_user_demographic(profile)
User demographics summary:
None of the features is normally distributed.
Gender - there are more numnber of malesthan females and a small portion of it are "others"Age - highest portion of users is between the age of 50 and 65Signup year - shows a yearwise increase in signups from 2013 to 2017, then a drop in 2018Income - highest portion of users between 50k and 75kdef merge_user_data_to_group_data(group_df, user_df=profile, merge_on='user_id',
cols_to_drop=['event', 'group', 'viewed', 'completed']):
'''
Left-merge user data into offer group data.
Args:
(1) group_df (Pandas dataframe) - offer group data
(2) user_df (Pandas dataframe) - user data
(3) merge_on (str) - name of column to merge on
(4) cols_to_drop (list[str]) - columns to drop from merged dataframe
Returns:
Group data left-merged with user data (Pandas dataframe).
'''
df = pd.merge(group_df, user_df, on=merge_on, how='left')
if isinstance(cols_to_drop, list) and len(cols_to_drop):
df.drop(cols_to_drop, axis=1, inplace=True)
return df
# Separate offers by group and merge user data into each group data
group1 = merge_user_data_to_group_data(offers.query('event == "offer received" and group == 1'))
group2 = merge_user_data_to_group_data(offers.query('event == "offer received" and group == 2'))
group3 = merge_user_data_to_group_data(offers.query('event == "offer received" and group == 3'))
group4 = merge_user_data_to_group_data(offers.query('event == "offer received" and group == 4'))
group1.shape, group2.shape, group3.shape, group4.shape
((9296, 19), (17866, 19), (11497, 19), (27842, 19))
# Plot user demographics by group
for n in range(1, 5):
print('Group', n)
print(eval(f'group{n}')[['age', 'income']].mean())
plot_user_demographic(eval(f'group{n}'), title=f'Group {n} User Demographics', height=500)
Group 1 age 51.895439 income 59394.255594 dtype: float64
Group 2 age 52.989197 income 61245.662152 dtype: float64
Group 3 age 55.816387 income 70079.238062 dtype: float64
Group 4 age 55.483227 income 68071.007830 dtype: float64
Group summary
# Group age and income into bins
profile['age_group'] = pd.qcut(profile.age, 5, labels=range(1, 6)).astype(int)
profile['income_group'] = pd.qcut(profile.income, 5, labels=range(1, 6)).astype(int)
profile.shape
(14825, 8)
# Calculate the total amount each user spent
total_spent = transcript.groupby('user_id').amount.sum().reset_index()
profile = pd.merge(profile, total_spent, on='user_id', how='left')
# Calculate the decile (10-quantile) each user's spending falls into
profile['spend_group'] = pd.qcut(profile.amount, 10, range(1, 11)).astype(int)
profile.sort_values('amount', ascending=False).head()
| gender | age | income | user_id | user_date | signup_year | age_group | income_group | amount | spend_group | |
|---|---|---|---|---|---|---|---|---|---|---|
| 12951 | M | 77 | 97000 | 12952 | 2017-09-26 | 2017 | 5 | 5 | 1608.69 | 10 |
| 11767 | M | 32 | 64000 | 11768 | 2018-06-19 | 2018 | 1 | 3 | 1365.66 | 10 |
| 9404 | F | 36 | 71000 | 9405 | 2017-05-21 | 2017 | 1 | 4 | 1327.74 | 10 |
| 9293 | M | 53 | 103000 | 9294 | 2015-08-25 | 2015 | 3 | 5 | 1321.42 | 10 |
| 8987 | M | 50 | 86000 | 8988 | 2016-05-12 | 2016 | 2 | 5 | 1319.97 | 10 |
# Income by age and gender
sns.boxplot(data=profile.query('gender != "O"'), x='age_group', y='income', hue='gender')
plt.title('Income by Age and Gender', fontsize='x-large')
plt.legend(title='Gender', bbox_to_anchor=(1, 1));
# Box plots with hue
fig, ax = plt.subplots(1, 2, sharey=True, figsize=(16, 4))
fig.suptitle('Total Amount Spent by User Demograpics', fontsize='x-large')
sns.boxplot(data=profile.query('gender != "O"'), x='age_group', y='amount', hue='gender', showfliers=False, ax=ax[0])
sns.boxplot(data=profile.query('gender != "O"'), x='income_group', y='amount', hue='gender', showfliers=False, ax=ax[1])
ax[0].legend(title='Gender', bbox_to_anchor=(1.025, 1))
ax[1].legend('');
# Aggregate functions for each feature
agg_funcs = {
'amount': 'mean', # avg amount spent
'user_id': 'count', # num users
'signup_year': 'median', # med signup year
'gender': lambda g: (g == 'M').mean() * 100, # % male
'age': 'mean', # avg age
'income': 'mean', # avg income
}
# Calculate user demographics by spending 10-quantile
spending_q10 = profile.groupby('spend_group').agg(agg_funcs).astype(int)
spending_q10.columns = ['avg_spent', 'n_users', 'median_signup', 'pct_male', 'avg_age', 'avg_income']
spending_q10['pct_female'] = 100 - spending_q10.pct_male
spending_q10
| avg_spent | n_users | median_signup | pct_male | avg_age | avg_income | pct_female | |
|---|---|---|---|---|---|---|---|
| spend_group | |||||||
| 1 | 6 | 1483 | 2018 | 74 | 51 | 53942 | 26 |
| 2 | 18 | 1483 | 2018 | 75 | 49 | 53429 | 25 |
| 3 | 31 | 1482 | 2017 | 71 | 50 | 55842 | 29 |
| 4 | 50 | 1482 | 2017 | 62 | 52 | 60085 | 38 |
| 5 | 74 | 1484 | 2017 | 55 | 55 | 65347 | 45 |
| 6 | 100 | 1481 | 2017 | 49 | 56 | 68137 | 51 |
| 7 | 129 | 1482 | 2017 | 48 | 56 | 69510 | 52 |
| 8 | 161 | 1483 | 2016 | 47 | 56 | 71810 | 53 |
| 9 | 205 | 1482 | 2016 | 44 | 56 | 76542 | 56 |
| 10 | 391 | 1483 | 2016 | 44 | 57 | 79405 | 56 |
# Plot changes in user demographics as spending increase
fig = subplots.make_subplots(3, 2, vertical_spacing=0.1, x_title='Spending quantile',
subplot_titles=['Average Amount Spent', 'Median Signup Year',
'Average Income', 'Average Age',
'Percent Female', 'Percent Male'])
fig.add_trace(go.Scatter(x=spending_q10.index, y=spending_q10.avg_spent, name='Average amount spent'), 1, 1)
fig.add_trace(go.Scatter(x=spending_q10.index, y=spending_q10.median_signup, name='Median signup year'), 1, 2)
fig.add_trace(go.Scatter(x=spending_q10.index, y=spending_q10.avg_income, name='Average income'), 2, 1)
fig.add_trace(go.Scatter(x=spending_q10.index, y=spending_q10.avg_age, name='Average age'), 2, 2)
fig.add_trace(go.Scatter(x=spending_q10.index, y=spending_q10.pct_female, name='Percent female'), 3, 1)
fig.add_trace(go.Scatter(x=spending_q10.index, y=spending_q10.pct_male, name='Percent male'), 3, 2)
# Layout
fig.update_layout(title='User Demographics as Spending Increases',
height=800, width=1000, showlegend=False)
fig.update_yaxes(nticks=3, row=1, col=2)
fig.show()
plt.figure(figsize=(10, 6))
sns.heatmap(spending_q10.corr(), annot=True)
plt.xticks(rotation=45, ha='right');
Users are grouped into 10 cohorts based on the total amount they spend during the month (in 10 quantiles). As spent amount increases, we can see a trend that is mostly monotonic for every feature: the signup year and percentage of male users decrease, while the age and income increase (for the most part). From these plots, we can make the following observations and inferences about how different types of users spend money on the app:
Signup year - users who have been using the app longer tend to spend moreIncome - users who earn more money tend to spend moreAge - user spending increases with age up to a certain age, and then the spending is consistent with the older halfGender - female users tend to spend more money# Merge user data into event data
offers = pd.merge(offers, profile.drop('amount', axis=1), on='user_id', how='left')
# Extract true offers by ID - unviewed completed offers are not counted
true_offers = offers.query('event == "offer received" and group != 3')
true_offers.head(
)
| index | event | time | offer_id | user_id | amount | reward | difficulty | duration | offer_type | ... | completed | group | gender | age | income | user_date | signup_year | age_group | income_group | spend_group | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | offer received | 0 | 7 | 2 | 0.0 | 5 | 5 | 7 | bogo | ... | 1 | 4 | F | 75 | 100000 | 2017-05-09 | 2017 | 5 | 5 | 8 |
| 1 | 2 | offer received | 0 | 5 | 3 | 0.0 | 2 | 10 | 7 | discount | ... | 0 | 2 | M | 68 | 70000 | 2018-04-26 | 2018 | 4 | 3 | 4 |
| 2 | 5 | offer received | 0 | 8 | 4 | 0.0 | 5 | 5 | 5 | bogo | ... | 1 | 4 | M | 65 | 53000 | 2018-02-09 | 2018 | 4 | 2 | 3 |
| 3 | 7 | offer received | 0 | 1 | 5 | 0.0 | 0 | 0 | 4 | informational | ... | 0 | 1 | M | 58 | 51000 | 2017-11-11 | 2017 | 3 | 2 | 2 |
| 4 | 8 | offer received | 0 | 6 | 6 | 0.0 | 5 | 20 | 10 | discount | ... | 0 | 2 | F | 61 | 57000 | 2017-09-11 | 2017 | 4 | 2 | 5 |
5 rows × 26 columns
def plot_demographic_completion(data, hue, x='offer_id', y='completed'):
'''
Plot offer completion rate by a specified user demographic feature.
Args:
(1) data (Pandas dataframe) - data to plot
(2) hue (str) - name of demographic feature column
(3) x (str) - name of offer ID column
(4) y (str) - name of completed indicator column
Returns: None
'''
title = hue.replace("_", " ").title()
plt.figure(figsize=(16, 4))
sns.barplot(data=data, x=x, y=y, hue=hue)
plt.title(f'Completion Rate by {title}', fontsize='xx-large')
plt.ylabel('Proportion of offers completed')
plt.xlabel('Offer ID')
plt.legend(title=title, bbox_to_anchor=(1, 1))
plt.show()
# Plot offer completion rate by user demographics
for feat in ['gender', 'age_group', 'income_group', 'signup_year', 'spend_group']:
print('Counts for:', true_offers.groupby(feat).offer_id.count())
plot_demographic_completion(true_offers, feat)
Counts for: gender F 21763 M 32448 O 793 Name: offer_id, dtype: int64
Counts for: age_group 1 11776 2 11253 3 10565 4 10958 5 10452 Name: offer_id, dtype: int64
Counts for: income_group 1 11840 2 11146 3 11719 4 10497 5 9802 Name: offer_id, dtype: int64
Counts for: signup_year 2013 1050 2014 2577 2015 5640 2016 10402 2017 20760 2018 14575 Name: offer_id, dtype: int64
Counts for: spend_group 1 6392 2 6181 3 5906 4 5722 5 5378 6 5095 7 4924 8 5115 9 5015 10 5276 Name: offer_id, dtype: int64
del group1, group2, group3, group4, total_spent, agg_funcs, spending_q10, true_offers, fig, n, feat
gc.collect()
55428
Since age and income are continuous variables, they are grouped into 5 quantiles in order to visualize how the different groups responded to offers from youngest users in age group 1 to oldest users in age group 5 and lowest earners in income group 1 to highest earners in income group 5.
The offers are ordered based on how difficult they are to complete: 2 informational, 4 discount and 4 BOGO offers. There's no threshold for informational offers. Discount offers are easier to complete than bogo offers because the amount can be accumulated with more than 1 transaction. Additionally, offers are sorted within types by increasing difficulty and decreasing duration. From the above barplots, following observations and inferences can br inferred:
Gender - female users have higher completion rate than male users for all 10 offers
Age - the youngest age group has the lowest completion rate than the other 4 groups for 9 out of 10 offers
Income - there is a monotonic relationship between income and offer completion rate for most of the part
Signup year - users that signed up in 2016 have the highest completion rate and users that signed up in 2018 have the lowest completion rate for 9 out of 10 offers
portfolio.to_pickle('../data/output1/portfolio.pkl') # offer metadata
profile.to_pickle('../data/output1/profile.pkl') # user data
transcript.reset_index().to_pickle('../data/output1/transcript.pkl') # event data
offers.to_pickle('../data/output1/offers.pkl') # transcript of offer-related events
portfolio.shape, profile.shape, transcript.shape, offers.shape
((10, 8), (14825, 10), (272388, 6), (148431, 26))